The data comes from the Lifelines cohort. Lifelines is a large, multi-generational, prospective cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. Within this cohort study the participants are followed over a 30-year period. Every five years, participants visit one of the Lifelines sites in the northern parts of the Netherlands for an assessment. During these visits, several physical measurements are taken and different biomaterials are collected. As part of the assessment, participants are asked to fill out comprehensive questionnaires. In between assessments, participants are invited to complete follow-up questionnaires approximately once every 1.5 years.
The data is aggregated on age groups in order to preserve the privacy of participants.The data is offered in two variants: complete and continuous. The complete variant contains all respondents that participated at one or both time points (T1 or T2).
For the research question, the change over time for a single group of participants was not of interest, so the complete data set was used. In this assignment, two parameters are selected and investigated to see if there is a relation.
Is there a relation between lower education level and osteoarthritis in the lifelines health dataset?
For the public dataset the percentage of participants whose highest education is ‘lower education’ is shown.
Lower education includes:
Osteoarthritis, or joint degradation, is measured as the percentage of participants that suffer from this condition.
More information on the cohurt study can be found on the lifelines wiki and the Toolkit - Aletta Year Challenge 2022.pdf
https://wiki-lifelines.web.rug.nl/doku.php?id=cohort
https://wiki-lifelines.web.rug.nl/doku.php?id=1a
import yaml
import pandas as pd
import numpy as np
#Statistics
import statsmodels.api as sm
from scipy.stats import shapiro
# Visualization:
import seaborn as sns
import hvplot.pandas
import matplotlib.pyplot as plt
# Bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, output_file, show
from bokeh.models import (BasicTicker, ColorBar, ColumnDataSource,
LinearColorMapper, PrintfTickFormatter, HoverTool)
from bokeh.transform import transform, dodge, factor_cmap, factor_mark
from bokeh.palettes import Viridis256
output_notebook()
# Panel
import panel as pn
pn.extension()
/usr/local/lib/python3.9/dist-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.2' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED /usr/local/lib/python3.9/dist-packages/pandas/core/arrays/masked.py:59: UserWarning: Pandas requires version '1.3.2' or newer of 'bottleneck' (version '1.2.1' currently installed). from pandas.core import (
Necessary functions:
from Summary_functions_and_code_clean_v11 import DS_Q_Q_Hist, DS_Q_Q_Plot
with open("config.yaml", 'r') as stream:
config = yaml.safe_load(stream)
age = config['lifelines_age']
# Name the dataframes
# Encoding utf_16 is necessary to read the table
age = pd.read_table(age, encoding='utf_16')
# Inspect dataframe data aggregated on age
age.head()
| AGE | GENDER | GROUP_SIZE_CAT | AGE_T1 | AGE_T2 | BMI_T1 | WEIGHT_T1 | HIP_T1 | WAIST_T1 | HEIGHT_T1 | ... | V_SUM_T1 | D_SUM_T1 | LTE_SUM_T1 | LDI_SUM_T1 | LTE_SUM_T2 | LDI_SUM_T2 | NSES_YEAR | NSES | MENTAL_DISORDER_T1 | MENTAL_DISORDER_T2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 2 | 17,15 | 20,00 | 21,53 | 72,20 | 89,85 | 78,03 | 182,96 | ... | 17,85 | 26,57 | 0,97 | 2,23 | 0,57 | 2,30 | 2010,00 | -0,67 | 0,33 | 1,07 |
| 1 | 25 | 2 | 4 | 17,20 | 19,96 | 22,27 | 64,91 | 91,80 | 75,34 | 170,67 | ... | 20,21 | 26,35 | 1,01 | 3,30 | 0,66 | 3,37 | 2010,00 | -0,67 | 1,40 | 1,87 |
| 2 | 26 | 1 | 3 | 17,89 | 21,06 | 22,15 | 74,92 | 91,49 | 79,89 | 183,78 | ... | 17,96 | 26,16 | 1,00 | 2,32 | 0,51 | 2,46 | 2010,00 | -0,66 | 0,41 | 2,11 |
| 3 | 26 | 2 | 5 | 18,08 | 21,07 | 22,57 | 65,84 | 92,87 | 76,39 | 170,78 | ... | 20,49 | 26,63 | 1,12 | 3,43 | 0,81 | 3,72 | 2010,00 | -0,69 | 1,60 | 3,20 |
| 4 | 27 | 1 | 3 | 18,74 | 21,84 | 22,57 | 75,66 | 92,13 | 80,82 | 182,94 | ... | 18,38 | 26,37 | 1,03 | 2,51 | 0,79 | 2,41 | 2010,00 | -0,86 | 1,11 | 1,76 |
5 rows × 72 columns
age.tail()
| AGE | GENDER | GROUP_SIZE_CAT | AGE_T1 | AGE_T2 | BMI_T1 | WEIGHT_T1 | HIP_T1 | WAIST_T1 | HEIGHT_T1 | ... | V_SUM_T1 | D_SUM_T1 | LTE_SUM_T1 | LDI_SUM_T1 | LTE_SUM_T2 | LDI_SUM_T2 | NSES_YEAR | NSES | MENTAL_DISORDER_T1 | MENTAL_DISORDER_T2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117 | 83 | 2 | 2 | 73,70 | 77,37 | 27,54 | 73,47 | 103,96 | 93,38 | 163,34 | ... | 21,00 | 27,33 | 1,09 | 0,72 | 0,77 | 0,61 | 2009,36 | -0,87 | 1,68 | 1,82 |
| 118 | 84 | 1 | 1 | 74,81 | 78,12 | 26,98 | 84,14 | 100,81 | 100,53 | 176,55 | ... | 23,00 | 31,00 | 0,87 | 0,64 | 0,83 | 0,43 | 2009,26 | -0,79 | 1,04 | 0,30 |
| 119 | 84 | 2 | 2 | 74,74 | 78,32 | 28,16 | 75,62 | 105,17 | 95,49 | 163,86 | ... | 1,19 | 0,88 | 0,87 | 0,59 | 2009,28 | -0,77 | 1,63 | 1,00 | ||
| 120 | 85 | 1 | 1 | 75,85 | 79,37 | 26,75 | 83,09 | 100,36 | 100,25 | 176,34 | ... | 1,01 | 0,86 | 0,91 | 0,54 | 2009,50 | -0,74 | 1,07 | 1,16 | ||
| 121 | 85 | 2 | 1 | 76,05 | 79,44 | 27,42 | 72,83 | 104,52 | 92,67 | 162,94 | ... | 1,15 | 0,83 | 0,88 | 0,69 | 2009,42 | -0,79 | 0,97 | 2,12 |
5 rows × 72 columns
# There seems to be missing data
# Check info of dataframe for dtype, missing values, columns
age.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 72 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGE 122 non-null int64 1 GENDER 122 non-null int64 2 GROUP_SIZE_CAT 122 non-null int64 3 AGE_T1 122 non-null object 4 AGE_T2 122 non-null object 5 BMI_T1 122 non-null object 6 WEIGHT_T1 122 non-null object 7 HIP_T1 122 non-null object 8 WAIST_T1 122 non-null object 9 HEIGHT_T1 122 non-null object 10 BMI_T2 122 non-null object 11 WEIGHT_T2 122 non-null object 12 HIP_T2 122 non-null object 13 WAIST_T2 122 non-null object 14 HEIGHT_T2 122 non-null object 15 EDUCATION_LOWER_T1 122 non-null object 16 EDUCATION_LOWER_T2 122 non-null object 17 WORK_T1 122 non-null object 18 WORK_T2 122 non-null object 19 LOW_QUALITY_OF_LIFE_T1 122 non-null object 20 LOW_QUALITY_OF_LIFE_T2 122 non-null object 21 DBP_T1 122 non-null object 22 DBP_T2 122 non-null object 23 HBF_T1 122 non-null object 24 HBF_T2 122 non-null object 25 MAP_T1 122 non-null object 26 MAP_T2 122 non-null object 27 SBP_T1 122 non-null object 28 SBP_T2 122 non-null object 29 HTN_MED_T1 122 non-null object 30 CHO_T1 122 non-null object 31 GLU_T1 122 non-null object 32 CHO_T2 122 non-null object 33 GLU_T2 122 non-null object 34 RESPIRATORY_DISEASE_T1 122 non-null object 35 PACKYEARS 122 non-null object 36 SMOKING 122 non-null object 37 METABOLIC_DISORDER_T1 122 non-null object 38 METABOLIC_DISORDER_T2 122 non-null object 39 LLDS 122 non-null object 40 Quintile_LLDS 122 non-null object 41 ALCOHOL_INTAKE_T1 122 non-null object 42 KCAL_INTAKE_T1 122 non-null object 43 MWK_VAL 122 non-null object 44 SCOR_VAL 122 non-null object 45 MWK_NO_VAL 122 non-null object 46 SCOR_NO_VAL 122 non-null object 47 SPORTS_T1 122 non-null object 48 CYCLE_COMMUTE_T1 122 non-null object 49 VOLUNTEER_T1 122 non-null object 50 PREGNANCIES 122 non-null object 51 OSTEOARTHRITIS_T1 122 non-null object 52 SLEEP_QUALITY 122 non-null object 53 DIAG_CFS_CDC 122 non-null object 54 DIAG_FIBROMYALGIA_ACR 122 non-null object 55 DIAG_IBS_ROME3 122 non-null object 56 C_SUM_T1 122 non-null object 57 A_SUM_T1 122 non-null object 58 SC_SUM_T1 122 non-null object 59 I_SUM_T1 122 non-null object 60 E_SUM_T1 122 non-null object 61 SD_SUM_T1 122 non-null object 62 V_SUM_T1 122 non-null object 63 D_SUM_T1 122 non-null object 64 LTE_SUM_T1 122 non-null object 65 LDI_SUM_T1 122 non-null object 66 LTE_SUM_T2 122 non-null object 67 LDI_SUM_T2 122 non-null object 68 NSES_YEAR 122 non-null object 69 NSES 122 non-null object 70 MENTAL_DISORDER_T1 122 non-null object 71 MENTAL_DISORDER_T2 122 non-null object dtypes: int64(3), object(69) memory usage: 68.8+ KB
f"The number of entries is {len(age)}"
'The number of entries is 122'
# Replace comma's with dots for the averages
age = age.replace(',', '.',regex=True)
# Replace the whitespaces with NaN values
age = age.replace(r'\s+', np.nan, regex=True)
# Check dataframe
age.tail()
| AGE | GENDER | GROUP_SIZE_CAT | AGE_T1 | AGE_T2 | BMI_T1 | WEIGHT_T1 | HIP_T1 | WAIST_T1 | HEIGHT_T1 | ... | V_SUM_T1 | D_SUM_T1 | LTE_SUM_T1 | LDI_SUM_T1 | LTE_SUM_T2 | LDI_SUM_T2 | NSES_YEAR | NSES | MENTAL_DISORDER_T1 | MENTAL_DISORDER_T2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117 | 83 | 2 | 2 | 73.70 | 77.37 | 27.54 | 73.47 | 103.96 | 93.38 | 163.34 | ... | 21.00 | 27.33 | 1.09 | 0.72 | 0.77 | 0.61 | 2009.36 | -0.87 | 1.68 | 1.82 |
| 118 | 84 | 1 | 1 | 74.81 | 78.12 | 26.98 | 84.14 | 100.81 | 100.53 | 176.55 | ... | 23.00 | 31.00 | 0.87 | 0.64 | 0.83 | 0.43 | 2009.26 | -0.79 | 1.04 | 0.30 |
| 119 | 84 | 2 | 2 | 74.74 | 78.32 | 28.16 | 75.62 | 105.17 | 95.49 | 163.86 | ... | NaN | NaN | 1.19 | 0.88 | 0.87 | 0.59 | 2009.28 | -0.77 | 1.63 | 1.00 |
| 120 | 85 | 1 | 1 | 75.85 | 79.37 | 26.75 | 83.09 | 100.36 | 100.25 | 176.34 | ... | NaN | NaN | 1.01 | 0.86 | 0.91 | 0.54 | 2009.50 | -0.74 | 1.07 | 1.16 |
| 121 | 85 | 2 | 1 | 76.05 | 79.44 | 27.42 | 72.83 | 104.52 | 92.67 | 162.94 | ... | NaN | NaN | 1.15 | 0.83 | 0.88 | 0.69 | 2009.42 | -0.79 | 0.97 | 2.12 |
5 rows × 72 columns
# Check the total number of missing values
f"The total number of missing values are {age.isnull().sum().sum()}"
'The total number of missing values are 116'
age.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 72 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGE 122 non-null int64 1 GENDER 122 non-null int64 2 GROUP_SIZE_CAT 122 non-null int64 3 AGE_T1 122 non-null object 4 AGE_T2 122 non-null object 5 BMI_T1 122 non-null object 6 WEIGHT_T1 122 non-null object 7 HIP_T1 122 non-null object 8 WAIST_T1 122 non-null object 9 HEIGHT_T1 122 non-null object 10 BMI_T2 122 non-null object 11 WEIGHT_T2 122 non-null object 12 HIP_T2 122 non-null object 13 WAIST_T2 122 non-null object 14 HEIGHT_T2 122 non-null object 15 EDUCATION_LOWER_T1 122 non-null object 16 EDUCATION_LOWER_T2 122 non-null object 17 WORK_T1 122 non-null object 18 WORK_T2 122 non-null object 19 LOW_QUALITY_OF_LIFE_T1 122 non-null object 20 LOW_QUALITY_OF_LIFE_T2 122 non-null object 21 DBP_T1 122 non-null object 22 DBP_T2 122 non-null object 23 HBF_T1 122 non-null object 24 HBF_T2 122 non-null object 25 MAP_T1 122 non-null object 26 MAP_T2 122 non-null object 27 SBP_T1 122 non-null object 28 SBP_T2 122 non-null object 29 HTN_MED_T1 122 non-null object 30 CHO_T1 122 non-null object 31 GLU_T1 122 non-null object 32 CHO_T2 122 non-null object 33 GLU_T2 122 non-null object 34 RESPIRATORY_DISEASE_T1 122 non-null object 35 PACKYEARS 122 non-null object 36 SMOKING 122 non-null object 37 METABOLIC_DISORDER_T1 122 non-null object 38 METABOLIC_DISORDER_T2 122 non-null object 39 LLDS 122 non-null object 40 Quintile_LLDS 122 non-null object 41 ALCOHOL_INTAKE_T1 120 non-null object 42 KCAL_INTAKE_T1 120 non-null object 43 MWK_VAL 122 non-null object 44 SCOR_VAL 122 non-null object 45 MWK_NO_VAL 122 non-null object 46 SCOR_NO_VAL 122 non-null object 47 SPORTS_T1 122 non-null object 48 CYCLE_COMMUTE_T1 122 non-null object 49 VOLUNTEER_T1 122 non-null object 50 PREGNANCIES 61 non-null object 51 OSTEOARTHRITIS_T1 122 non-null object 52 SLEEP_QUALITY 95 non-null object 53 DIAG_CFS_CDC 122 non-null object 54 DIAG_FIBROMYALGIA_ACR 122 non-null object 55 DIAG_IBS_ROME3 122 non-null object 56 C_SUM_T1 119 non-null object 57 A_SUM_T1 119 non-null object 58 SC_SUM_T1 119 non-null object 59 I_SUM_T1 119 non-null object 60 E_SUM_T1 119 non-null object 61 SD_SUM_T1 119 non-null object 62 V_SUM_T1 119 non-null object 63 D_SUM_T1 119 non-null object 64 LTE_SUM_T1 122 non-null object 65 LDI_SUM_T1 122 non-null object 66 LTE_SUM_T2 122 non-null object 67 LDI_SUM_T2 122 non-null object 68 NSES_YEAR 122 non-null object 69 NSES 122 non-null object 70 MENTAL_DISORDER_T1 122 non-null object 71 MENTAL_DISORDER_T2 122 non-null object dtypes: int64(3), object(69) memory usage: 68.8+ KB
Personality traits:
( According to the toolkit: Only completed cases, as mean, of the personality questionaire are in the data set.)
Health and lifestyle:
Only women can get pregnant, so it make sense that there is data for only the females.
# Uniques and counts for the Age column
age.AGE.value_counts()
25 2
56 2
58 2
59 2
60 2
..
50 2
51 2
52 2
53 2
85 2
Name: AGE, Length: 61, dtype: int64
Age categories range from 25 to 85 years old and are sperated in male and female
# Change the dtype of the data frame to floats
age = age.astype(float)
# First three columns should stay integers
age = age.astype({'AGE':'int', 'GENDER':'int','GROUP_SIZE_CAT':'int'})
# Check if the changes worked
age.head()
| AGE | GENDER | GROUP_SIZE_CAT | AGE_T1 | AGE_T2 | BMI_T1 | WEIGHT_T1 | HIP_T1 | WAIST_T1 | HEIGHT_T1 | ... | V_SUM_T1 | D_SUM_T1 | LTE_SUM_T1 | LDI_SUM_T1 | LTE_SUM_T2 | LDI_SUM_T2 | NSES_YEAR | NSES | MENTAL_DISORDER_T1 | MENTAL_DISORDER_T2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 2 | 17.15 | 20.00 | 21.53 | 72.20 | 89.85 | 78.03 | 182.96 | ... | 17.85 | 26.57 | 0.97 | 2.23 | 0.57 | 2.30 | 2010.0 | -0.67 | 0.33 | 1.07 |
| 1 | 25 | 2 | 4 | 17.20 | 19.96 | 22.27 | 64.91 | 91.80 | 75.34 | 170.67 | ... | 20.21 | 26.35 | 1.01 | 3.30 | 0.66 | 3.37 | 2010.0 | -0.67 | 1.40 | 1.87 |
| 2 | 26 | 1 | 3 | 17.89 | 21.06 | 22.15 | 74.92 | 91.49 | 79.89 | 183.78 | ... | 17.96 | 26.16 | 1.00 | 2.32 | 0.51 | 2.46 | 2010.0 | -0.66 | 0.41 | 2.11 |
| 3 | 26 | 2 | 5 | 18.08 | 21.07 | 22.57 | 65.84 | 92.87 | 76.39 | 170.78 | ... | 20.49 | 26.63 | 1.12 | 3.43 | 0.81 | 3.72 | 2010.0 | -0.69 | 1.60 | 3.20 |
| 4 | 27 | 1 | 3 | 18.74 | 21.84 | 22.57 | 75.66 | 92.13 | 80.82 | 182.94 | ... | 18.38 | 26.37 | 1.03 | 2.51 | 0.79 | 2.41 | 2010.0 | -0.86 | 1.11 | 1.76 |
5 rows × 72 columns
# Descriptive stats of data frame
age.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| AGE | 122.0 | 55.000000 | 17.679423 | 25.00 | 40.0000 | 55.000 | 70.0000 | 85.00 |
| GENDER | 122.0 | 1.500000 | 0.502062 | 1.00 | 1.0000 | 1.500 | 2.0000 | 2.00 |
| GROUP_SIZE_CAT | 122.0 | 9.811475 | 4.627142 | 1.00 | 6.0000 | 10.500 | 15.0000 | 15.00 |
| AGE_T1 | 122.0 | 45.954918 | 17.519626 | 17.15 | 30.8500 | 45.750 | 61.0775 | 76.05 |
| AGE_T2 | 122.0 | 49.544016 | 17.547281 | 19.96 | 34.5850 | 49.460 | 64.5725 | 79.44 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| LDI_SUM_T2 | 122.0 | 1.922131 | 0.928080 | 0.43 | 0.8925 | 2.185 | 2.5675 | 3.72 |
| NSES_YEAR | 122.0 | 2009.511148 | 0.239190 | 2008.66 | 2009.4000 | 2009.495 | 2009.6375 | 2010.00 |
| NSES | 122.0 | -0.676475 | 0.150640 | -1.06 | -0.7475 | -0.640 | -0.5625 | -0.42 |
| MENTAL_DISORDER_T1 | 122.0 | 1.607541 | 0.568474 | 0.33 | 1.1800 | 1.515 | 2.1300 | 2.60 |
| MENTAL_DISORDER_T2 | 122.0 | 1.838361 | 0.812315 | 0.19 | 1.1700 | 1.865 | 2.4100 | 3.77 |
72 rows × 8 columns
Select the columns that might be relevant for the research question
age_new = age[['AGE','EDUCATION_LOWER_T1', 'OSTEOARTHRITIS_T1','GROUP_SIZE_CAT', 'GENDER']]
f"The total number of missing values are {age_new.isnull().sum().sum()}"
'The total number of missing values are 0'
The columns that seems relevant and are selected don't have any data missing, so no interpolation is needed.
# Make a copy of the newly created dataframe
df = age_new.copy()
According to the toolkit of lifelines:
Gender is:
# Change the values of the gender column to their corresponding name
df['GENDER'] = df['GENDER'].replace([1, 2], ['Male', 'Female'])
df.head()
| AGE | EDUCATION_LOWER_T1 | OSTEOARTHRITIS_T1 | GROUP_SIZE_CAT | GENDER | |
|---|---|---|---|---|---|
| 0 | 25 | 57.03 | 0.0 | 2 | Male |
| 1 | 25 | 53.23 | 0.0 | 4 | Female |
| 2 | 26 | 46.94 | 0.0 | 3 | Male |
| 3 | 26 | 42.86 | 0.0 | 5 | Female |
| 4 | 27 | 37.11 | 0.0 | 3 | Male |
See if there correlations amongs the variables:
df.corr().abs()
/tmp/ipykernel_762813/3426995566.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr().abs()
| AGE | EDUCATION_LOWER_T1 | OSTEOARTHRITIS_T1 | GROUP_SIZE_CAT | |
|---|---|---|---|---|
| AGE | 1.000000 | 0.792093 | 0.857631 | 0.241149 |
| EDUCATION_LOWER_T1 | 0.792093 | 1.000000 | 0.878771 | 0.515433 |
| OSTEOARTHRITIS_T1 | 0.857631 | 0.878771 | 1.000000 | 0.348120 |
| GROUP_SIZE_CAT | 0.241149 | 0.515433 | 0.348120 | 1.000000 |
#correlation matrix
corr = df.corr().abs()
sns.heatmap(corr, annot=True)
/tmp/ipykernel_762813/32526792.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = df.corr().abs()
<AxesSubplot: >
c = df.corr().abs()
# Set the x and y range for the heatmap
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c
/tmp/ipykernel_762813/4163679710.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. c = df.corr().abs()
| AGE | EDUCATION_LOWER_T1 | OSTEOARTHRITIS_T1 | GROUP_SIZE_CAT | |
|---|---|---|---|---|
| AGE | 1.000000 | 0.792093 | 0.857631 | 0.241149 |
| EDUCATION_LOWER_T1 | 0.792093 | 1.000000 | 0.878771 | 0.515433 |
| OSTEOARTHRITIS_T1 | 0.857631 | 0.878771 | 1.000000 | 0.348120 |
| GROUP_SIZE_CAT | 0.241149 | 0.515433 | 0.348120 | 1.000000 |
# Reshape
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
dfc.head()
| level_0 | level_1 | r | |
|---|---|---|---|
| 0 | AGE | AGE | 1.000000 |
| 1 | AGE | EDUCATION_LOWER_T1 | 0.792093 |
| 2 | AGE | OSTEOARTHRITIS_T1 | 0.857631 |
| 3 | AGE | GROUP_SIZE_CAT | 0.241149 |
| 4 | EDUCATION_LOWER_T1 | AGE | 0.792093 |
# Heatmap based on the heatmap example from the gitbook
def heatmap():
#create colormapper
mapper = LinearColorMapper(palette=Viridis256, low=dfc.r.min(), high=dfc.r.max())
#create plot
p = figure(title="Correlation heatmap life lines data", plot_width=500, plot_height=450,
x_range=x_range, y_range=y_range, x_axis_location="above", toolbar_location=None)
source = ColumnDataSource(dfc)
text_props = dict(source=source, text_align="left", text_baseline="middle")
#use mapper to fill the rectangles in the plot
p.rect(x="level_0", y="level_1", width=1, height=1, source=source,
line_color=None, fill_color=transform('r', mapper))
# Add the values of r to the rectangles
# Round the numbers to fit the box
dfc['rtest'] = round(dfc['r'],3)
# Text needs to be string
dfc['rfinal'] = dfc['rtest'].astype(str)
# Add the text to the boxes (from bokeh documentation: categorical plots)
text_props = dict(source=dfc, text_align="left", text_baseline="middle")
x = dodge("level_0", -0.4, range=p.x_range)
p.text(x=x, y="level_1", text="rfinal", text_font_style="bold", **text_props)
#create and add colorbar to the right
color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
ticker=BasicTicker(desired_num_ticks=len(x_range)),
formatter=PrintfTickFormatter(format="%.1f"))
p.add_layout(color_bar, 'right')
#draw axis
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "10px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0
show(p)
heatmap()
There seems to be a strong positive relation between lower education and osteoarthritis (0.879).
Also, there seems to be strong positive relations between lower education and age (0.792) and even more for osteoarthritis and age (0.858).
A QQ-plot is used to visually determine how close a sample is to a the Normal distribution. If the points fall roughly on the diagonal line, then the samples can be considered to be distributed normal.
# Q_Q plot of lower education (statsmodels)
fig = sm.qqplot(df.EDUCATION_LOWER_T1, fit = True, line = '45')
plt.show()
# Q_Q plot Lower education (DS1 bayesian statistics)
DS_Q_Q_Plot(df.EDUCATION_LOWER_T1)
Estimation method: robust n = 122, mu = 32.98, sigma = 20.74 Expected number of data outside CI: 6
With the Bayesian DS1 Q_Q plot you can see that the data falls largely within the 95% confidence interval.
Some points fall outside, but still with the expected number.
The data at the end bends from the diagonal line, so the data from education might not be normally distributed.
# Q_Q plot of Osteoarthritis (statsmodels)
fig = sm.qqplot(df.OSTEOARTHRITIS_T1, fit = True, line = '45')
plt.show()
# Q_Q plot Osteoarthrithis
DS_Q_Q_Plot(df.OSTEOARTHRITIS_T1)
Estimation method: robust n = 122, mu = 4.95, sigma = 9.85 Expected number of data outside CI: 6
Most of the data falls outside of the 95% CI and is definitely not linear (more sigmoidal)
The data of osteoarthritis does not look normally distributed.
Evenly distributed (uniform) data yield a sigmoidal Q-Q plot. The Q_Q plots seem to be sigmoidal, so data might be uniform. Check this with histogram.
Before assessing the distribution statistically, plot the distribution first.
histogram = df.hvplot.hist('EDUCATION_LOWER_T1', ylabel='counts', alpha = 0.4,
title='Histogram lower education percentage').opts(toolbar=None)
histogram
# Histogram of education by gender
histogram = df.hvplot.hist('OSTEOARTHRITIS_T1',ylabel='counts', alpha = 0.4,
title='Histogram osteoarthritis percentage').opts(toolbar=None)
histogram
print('Education')
DS_Q_Q_Hist(df.EDUCATION_LOWER_T1)
print('Osteoarthritis')
DS_Q_Q_Hist(df.OSTEOARTHRITIS_T1)
Education Estimation method: robust mu = 32.98, sigma = 20.74
Osteoarthritis Estimation method: robust mu = 4.95, sigma = 9.85
For both types of histograms, lower education and osteoarthritis do not seem to be normally distributed.
A density plot predicts the count value and smoothens across the x-axis. They're not affected by bins, so they are better at determining the distribution shape
# Density plot education by gender
df.hvplot.kde('EDUCATION_LOWER_T1', alpha=0.5, title='Density plot of education level')
# Density plot education by gender
df.hvplot.kde('OSTEOARTHRITIS_T1', alpha=0.5, title='Density plot of Osteoarthritis')
stat, p = shapiro(df['EDUCATION_LOWER_T1'])
print('Lower education H0: The data is normally distributed')
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Data looks Gaussian (fail to reject H0)')
else:
print('Data does not look Gaussian (reject H0)')
Lower education H0: The data is normally distributed Statistics=0.957, p=0.001 Data does not look Gaussian (reject H0)
stat, p = shapiro(df['OSTEOARTHRITIS_T1'])
print('Osteoarthritis H0: The data is normally distributed')
print('Statistics=%.3f, p=%.10f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
Osteoarthritis H0: The data is normally distributed Statistics=0.835, p=0.0000000002 Sample does not look Gaussian (reject H0)
The data is definitely not normally distributed.
When the data is not normally distributed, the type of analysis changes. Linear models, like t-test, ANOVA and linear regression go by the assumption that the data is normally distributed.
An option is to analyze the data by using nonparametric tests, which do not make any assumption about the underlying distribution of the data.
from scipy.stats import spearmanr
rho, p = spearmanr(df['EDUCATION_LOWER_T1'],df['OSTEOARTHRITIS_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
print('There is a positive correlation between lower education and osteoarthritis')
else:
print('There is a negative correlation between lower education and osteoarthritis')
# Significance
if p > alpha:
print('The relationship between lower education and osteoarthritis is statistically not significantly')
else:
print('The relationship between lower education and osteoarthritis is statistically significant')
rho=0.807, p=0.000 There is a positive correlation between lower education and osteoarthritis The relationship between lower education and osteoarthritis is statistically significant
There is a significant positive correlation between lower education and oseoarthritis.
rho, p = spearmanr(df['AGE'],df['OSTEOARTHRITIS_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
print('There is a positive correlation between age and osteoarthritis')
else:
print('There is a negative correlation between age and osteoarthritis')
# Significance
if p > alpha:
print('The relationship between age and osteoarthritis is statistically not significantly')
else:
print('The relationship between age and osteoarthritis is statistically significant')
rho=0.950, p=0.000 There is a positive correlation between age and osteoarthritis The relationship between age and osteoarthritis is statistically significant
rho, p = spearmanr(df['AGE'],df['EDUCATION_LOWER_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
print('There is a positive correlation between age and osteoarthritis')
else:
print('There is a negative correlation between age and osteoarthritis')
# Significance
if p > alpha:
print('The relationship between age and osteoarthritis is statistically not significantly')
else:
print('The relationship between age and osteoarthritis is statistically significant')
rho=0.797, p=0.000 There is a positive correlation between age and osteoarthritis The relationship between age and osteoarthritis is statistically significant
The values of the correlations change based on the type of test used:
# Default correlation test
print('Default settings')
df.corr().abs()
Default settings
/tmp/ipykernel_762813/2806251345.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr().abs()
| AGE | EDUCATION_LOWER_T1 | OSTEOARTHRITIS_T1 | GROUP_SIZE_CAT | |
|---|---|---|---|---|
| AGE | 1.000000 | 0.792093 | 0.857631 | 0.241149 |
| EDUCATION_LOWER_T1 | 0.792093 | 1.000000 | 0.878771 | 0.515433 |
| OSTEOARTHRITIS_T1 | 0.857631 | 0.878771 | 1.000000 | 0.348120 |
| GROUP_SIZE_CAT | 0.241149 | 0.515433 | 0.348120 | 1.000000 |
# Change method to spearman
print('Spearman test')
df.corr(method='spearman').abs()
Spearman test
/tmp/ipykernel_762813/2667412546.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr(method='spearman').abs()
| AGE | EDUCATION_LOWER_T1 | OSTEOARTHRITIS_T1 | GROUP_SIZE_CAT | |
|---|---|---|---|---|
| AGE | 1.000000 | 0.796946 | 0.950445 | 0.227519 |
| EDUCATION_LOWER_T1 | 0.796946 | 1.000000 | 0.807291 | 0.463873 |
| OSTEOARTHRITIS_T1 | 0.950445 | 0.807291 | 1.000000 | 0.155002 |
| GROUP_SIZE_CAT | 0.227519 | 0.463873 | 0.155002 | 1.000000 |
The heatmap would look different too:
c = df.corr(method='spearman').abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
heatmap()
/tmp/ipykernel_762813/2418570910.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. c = df.corr(method='spearman').abs()
Based on the Spearman test:
# Accesible Palettes: Bright colors
# https://personal.sron.nl/~pault/
palette = ['#EE6677','#66CCEE']
# Define factors for plots
GENDER = sorted(df.GENDER.unique())
MARKERS = ['circle', 'hex']
def edu_osteo():
# Create Figure
p = figure(title = "Relationship lower education level and osteoarthritis", toolbar_location=None,
plot_width = 700, plot_height = 700)
# Make plot
p.scatter("EDUCATION_LOWER_T1", "OSTEOARTHRITIS_T1", source=df,
legend_group="GENDER", fill_alpha=0.4, size=15, line_width=1.5,
marker=factor_mark('GENDER', MARKERS, GENDER),
color=factor_cmap('GENDER', palette, GENDER))
#Legend
p.legend.location = "top_left"
#Axis labels
p.xaxis.axis_label = 'Percentage osteoarthritis (%)'
p.yaxis.axis_label = 'Percentage lower education level (%)'
# Hovertool
hover = HoverTool(tooltips=[('Education','@EDUCATION_LOWER_T1{0,0.00}%'),
("Osteoarthritis", "@OSTEOARTHRITIS_T1{0,0.00}%")])
p.add_tools(hover)
return p
def age_osteo():
# Create Figure
p = figure(title = "Percentage osteoarthritis by age",toolbar_location=None,
plot_width = 700, plot_height = 700)
# Make plot
r = p.scatter("AGE", "OSTEOARTHRITIS_T1", source=df,
legend_group="GENDER", fill_alpha=0.4, size=15, line_width=1.5,
marker=factor_mark('GENDER', MARKERS, GENDER),
color=factor_cmap('GENDER', palette, GENDER))
#Legend
p.legend.location = "top_left"
# Axis
p.yaxis.axis_label = 'Percentage osteoarthritis (%)'
p.xaxis.axis_label = 'Age in years'
# Hovertool
hover = HoverTool(tooltips=[('Age','@AGE'),
("Osteoarthritis", "@OSTEOARTHRITIS_T1{0,0.00}%")])
p.add_tools(hover)
return p
def age_edu():
# Create Figure
p = figure(title = "Percentage lower education level by age", toolbar_location=None,
plot_width = 700, plot_height = 700)
# Make plot
p.scatter("AGE", "EDUCATION_LOWER_T1", source=df,
legend_group="GENDER", fill_alpha=0.4, size=15,line_width=1.5,
marker=factor_mark('GENDER', MARKERS, GENDER),
color=factor_cmap('GENDER', palette, GENDER))
#Legend
p.legend.location = "top_left"
#Axis labels
p.yaxis.axis_label = 'Percentage lower education (%)'
p.xaxis.axis_label = 'Age in years'
# Hovertool to see the values in the plot
hover = HoverTool(tooltips=[('Age','@AGE'),
("Lower education", "@EDUCATION_LOWER_T1{0,0.00}%")])
p.add_tools(hover)
return p
# Create tabs to switch between plots
tabs = pn.Tabs()
tabs.append(('Lower education - Osteoarthritis', edu_osteo))
tabs.append(('Age - Osteoarthritis', age_osteo))
tabs.append(('Age - Lower education', age_edu))
tabs
Looking at the plots, the relation between lower education and osteoarthritis seems to follow a different pattern for men and women, but is higher for women. Women seem to have higher percentage of lower education and osteoarthritis than men. The pattern/trend of comparing age to osteoarthritis for men and women looks similar to the relation between education and osteoarthritis. How older someone is the higher the percentage of osteoarthritis.
When comparing age to lower education level you can also observe an positive relation. The higher the age, the higher the percentage of lower education.
Women aged 65-85 seem to have higher percentages of osteoarthritis then the men in the same age categories. Around 65 years old you see the women surpassing the men in percentage lower education level as well. Women are more prone to developing osteoarthritis than men, especially after the age of 50 (possibly due to menopause).
https://www.arthritis-health.com/blog/why-are-women-more-prone-osteoarthritis
https://www.niams.nih.gov/health-topics/osteoarthritis
Joint overuse such as knee bending and repetitive stress on a joint, can damage a joint and increase the risk of Osteoarthritis in that joint. It is possible that due to having an lower education people people have more jobs related to manual labor and occupations involving manual labor might increase the chance of having Osteoarthritis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4562436/
There are seem to be a few outliers when you compare for all the participants in the dataset.
There are a few age categories that have high lower education level, but low osteoarthritis. When looking at age to education level you can see that for the categories 25-29 there are higher percentages in lower education(and do not follow the trend). The low lying (outlier) data points in the relation between osteoarthritis and education level could very well be these younger age categories, because osteoarthritis is more common as people age.
It is interesting to see that the current 25-29 age groups seem to have higher lower education percentages compared to what you see for the ages 30 to 85. There are various reasons and factors as why this could be the case. One possible explanation could be that young people take some time in deciding what to study and go for higher education later in life. Another explanation could be that younger people with higher educaions have moved away and do not live in these regions anymore or have simply not participated in this study.
In short:
The research question was: Is there a relation between lower education level and osteoarthritis?
The results show that there is a positive significant relation between lower education and osteoarthritis.